11/11/2022

glmmTMB

  • Generalized linear mixed models (GLMMs) can be useful for non-normal data with random effects
  • \(\texttt{glmmTMB}\) has a familiar interface based off \(\texttt{lme4}\)
  • \(\texttt{glmmTMB}\) is a fast, flexible and stable package (Brooks et al. 2017)
  • It has more distributions available
  • Plus flexible zero-inflated models and hurdle models

House work

First, let’s load some packages

packages <- c("ggplot2", "dplyr", "tidyr", "glmmTMB", "ecostats", "DHARMa")
install.packages(setdiff(packages, rownames(installed.packages()))) 
invisible(lapply(packages, library, character.only = TRUE))
## Warning: package 'ecostats' was built under R version 4.2.2
source("functions.R")

Wind Farm Example

  • A study in Sweden investigated the effects of an offshore windfarm on the abundance and distribution patterns of benthic fish communities

Wind Farm Example

The wind farm data is in the ecostats package.

wf <- cbind(windFarms$abund, windFarms$X) %>% 
  mutate(ID = 1:nrow(.)) %>% 
  pivot_longer(cols = colnames(windFarms$abund), 
               names_to = "Species",
               values_to = "Abundance")

Wind Farm Example: Study design

head(wf)
## # A tibble: 6 × 7
##   Year  Zone  Station Impact    ID Species     Abundance
##   <fct> <fct> <fct>   <fct>  <int> <chr>           <int>
## 1 2003  WF    8.1     Before     1 Bergvar             0
## 2 2003  WF    8.1     Before     1 Oxsimpa             0
## 3 2003  WF    8.1     Before     1 Piggvar             0
## 4 2003  WF    8.1     Before     1 Roodspotta          0
## 5 2003  WF    8.1     Before     1 Rootsimpa           0
## 6 2003  WF    8.1     Before     1 Sandskaadda         0

If you want more information you can find it in the help file:

?windFarms

Wind Farm Example: Data

  • Keep stations that were observed in both years
  • Remove species that were observed less than 5 times
stat03 <- unique(wf[wf$Year=="2003",]$Station)
stat10 <- unique(wf[wf$Year=="2010",]$Station)
sum_abund <- wf %>%
  filter(Station %in% stat03 ) %>%
  group_by(Species) %>%
  summarise(obs = sum(Abundance > 0) ) %>% 
  ungroup() %>% 
  arrange(-obs) # order from most to least abundant

spp_names <- unique(sum_abund$Species[sum_abund$obs>=5])
wf_data <- wf %>% 
  filter(Species %in% spp_names) %>% 
  mutate(Species = factor(Species, levels = spp_names))

Wind Farm Example: Data

## Boxplot of abundance of species by year and zone
labs=c(0,1,10,100)
wf_boxplot <- wf_data %>% 
  ggplot(aes(x = Year, y = log(Abundance + 1), fill = Zone)) +
  geom_boxplot(outlier.shape = 21)+
  facet_wrap(~ Species, ncol = 3, scales = "free") +
  labs(x = "Year", color  = "Zone", y = "Abundance [log(y+1)-scale]") +
  scale_y_continuous( breaks = log(labs + 1), labels = labs)+
  theme_classic() +
  theme(legend.position="bottom"  ) +
  scale_fill_brewer(palette = "Set2")

Wind Farm Example: Data

wf_boxplot

Wind Farm Example: Why GLMM?

Generalised Linear Mixed Model

A generalised linear mixed model (GLMM) has the following form:

\[ g(\boldsymbol{\mu}) = \boldsymbol{ X \beta} + \boldsymbol{Zb} \] where

  • g(.) is a known link function
  • \(\boldsymbol{X}\) is a \(n \times p\) model matrix
  • \(\boldsymbol{\beta}\) is a \(p\)-dimensional vector of regression coefficients related to the covariates
  • \(\boldsymbol{Z}\) is a \(n \times q\) model matrix for the \(q\)-dimensional random effects
  • \(\boldsymbol{b}\) are multivariate random effects, \(\boldsymbol{b} \sim N(\boldsymbol{0}, \boldsymbol{\Sigma} )\)
  • \(\boldsymbol{y} | \boldsymbol{b} \sim F(\mu, \phi )\), where F is a member of the exponential family, mean \(\boldsymbol{\mu}\), nuisance \(\boldsymbol{\phi}\)

Generalised Linear Mixed Model

Consider the joint model of the mean abundance, \(\mu_{ijk}\), observed at \(i = 1, \ldots, n\) observation, \(k = 1,\ldots, l\) stations, for \(j = 1, \ldots, p\) species as follows:

\[ g(\mu_{ijk}) = \boldsymbol{x_i'\beta} + \boldsymbol{x_i'b_{j}^{[x]}} + {b_{k}^{[s]}} + b_{ij} \]

  • \(\boldsymbol{x_i}\) are vectors of covariates relating to intercept, zone, year and the interaction of zone and year
  • \(\boldsymbol{\beta}\) are vectors of fixed coefficients
  • \(\boldsymbol{b_{j}^{[x]}} \sim N(\boldsymbol{0}, \boldsymbol{\sigma_x^2I})\)
  • \({b_{k}^{[s]}} \sim N(0, \sigma^2_s)\)
  • \(\boldsymbol{b_{i}} \sim N(\boldsymbol{0}, \Sigma_{sp})\)

GLMM: wind farm

We assume no prior structure about the correlation across species, i.e., unstructured covariance matrix.

fit.us <- glmmTMB(Abundance ~ Year*Zone + diag(Year*Zone|Species) + 
                 (1 | Station) +
                 (Species + 0|ID),
               wf_data, family = poisson)
## Warning in fitTMB(TMBStruc): Model convergence
## problem; non-positive-definite Hessian matrix. See
## vignette('troubleshooting')
## Warning in fitTMB(TMBStruc): Model convergence problem;
## false convergence (8). See vignette('troubleshooting')

GLMM: reduced-rank random effects

A reduced-rank approach to fit a multivariate random effect

\[b_{ij} = \boldsymbol{\lambda}_j \boldsymbol{u}_{i}\]

where

  • \(\boldsymbol{u_i}\) is a vector of \(d\) latent variables
  • \(\boldsymbol{u_i} \sim N(\boldsymbol{0}, \boldsymbol{I})\)
  • \(\boldsymbol{\Lambda} = \left\{\boldsymbol{\lambda}_1^\top, \ldots, \boldsymbol{\lambda}_p\right\}^\top\) are a matrix of factor loadings

This implies \(\boldsymbol{b_i} \sim N(\boldsymbol{0}, \boldsymbol{\Lambda}\boldsymbol{\Lambda'})\)

GLMM: reduced-rank model

wf.glmm <- glmmTMB(Abundance ~ Zone*Year + diag(Zone*Year|Species) +
                     (1|Station) + 
                     rr(Species + 0|ID, 2), 
                   family = "poisson", data = wf_data,
                   control = glmmTMBControl(start_method = list(method = "res")))
  • The number after the comma in the rr formula specifies the rank of the variance-covariance matrix of the multivariate random effects.

  • The start_method argument in glmmTMBControl() controls the algorithm used for initialising the rr parameters (\(\boldsymbol{\lambda}_j\) and \(\boldsymbol{u}_{i}\))

GLMM: summary

summary(wf.glmm)
##  Family: poisson  ( log )
## Formula:          
## Abundance ~ Zone * Year + diag(Zone * Year | Species) + (1 |  
##     Station) + rr(Species + 0 | ID, 2)
## Data: wf_data
## 
##      AIC      BIC   logLik deviance df.resid 
##   2914.1   3075.6  -1427.1   2854.1     1581 
## 
## Random effects:
## 
## Conditional model:
##  Groups  Name                 Variance Std.Dev. Corr       
##  Species (Intercept)          3.867064 1.96649             
##          ZoneN                0.907621 0.95269  0.00       
##          ZoneS                4.070795 2.01762  0.00  0.00 
##          Year2010             1.676825 1.29492  0.00  0.00 
##          ZoneN:Year2010       0.837671 0.91524  0.00  0.00 
##          ZoneS:Year2010       0.330832 0.57518  0.00  0.00 
##  Station (Intercept)          0.033223 0.18227             
##  ID      SpeciesTorsk         0.005312 0.07288             
##          SpeciesTanglake      0.214615 0.46327   0.91      
##          SpeciesAL            0.370474 0.60867  -0.45 -0.78
##          SpeciesOxsimpa       0.116795 0.34175  -1.00 -0.89
##          SpeciesStensnultra   0.575227 0.75844   0.58  0.19
##          SpeciesRootsimpa     0.427284 0.65367   0.83  0.53
##          SpeciesSkrubbskaadda 0.592369 0.76966   0.37 -0.06
##          SpeciesSjurygg       0.505660 0.71110   0.67  0.30
##          SpeciesSandskaadda   3.149179 1.77459   0.67  0.30
##                                      
##                                      
##                                      
##                                      
##  0.00                                
##  0.00  0.00                          
##  0.00  0.00  0.00                    
##                                      
##                                      
##                                      
##                                      
##   0.41                               
##   0.47 -0.62                         
##   0.12 -0.86  0.93                   
##   0.67 -0.41  0.97  0.82             
##   0.36 -0.70  0.99  0.97  0.94       
##   0.36 -0.71  0.99  0.97  0.93  1.00 
## Number of obs: 1611, groups:  
## Species, 9; Station, 108; ID, 179
## 
## Conditional model:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -1.95462    0.70374  -2.777  0.00548 **
## ZoneN          -0.15727    0.42628  -0.369  0.71217   
## ZoneS          -0.04281    0.73649  -0.058  0.95364   
## Year2010       -0.14088    0.51263  -0.275  0.78346   
## ZoneN:Year2010 -0.17822    0.46092  -0.387  0.69900   
## ZoneS:Year2010  0.62528    0.39447   1.585  0.11293   
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GLMM: ordination plot

We can visualise the correlations between species from the estimated latent variables and factor loadings from a model without predictors, i.e., an unconstrained ordination plot

wf.glmm.unc <- glmmTMB(Abundance ~  Species + rr(Species + 0 | ID, d = 2),
                       family = "poisson", data = wf_data, 
                       control = glmmTMBControl(start_method = list(method = "res")  ))

To get the factor laodings and latent variables for the reduced-rank random effect:

wf.unc.fl.b <- extract.rr.par(wf.glmm.unc)
unc.wf.b <- as.data.frame(wf.unc.fl.b$b)
unc.wf.b$ID <- rownames(unc.wf.b)
unc.wf.fl <-  as.data.frame(wf.unc.fl.b$fl)

GLMM: ordination plot

cnms <- wf.glmm.unc$modelInfo$reTrms[["cond"]]$cnms
labels <- gsub("Species", "", cnms$ID)
names(unc.wf.b)[1:ncol(unc.wf.fl)] <- paste0("LV", 1:ncol(unc.wf.fl))

zone_col <- wf_data %>%
  select(ID, Zone, Year) %>%
  distinct() %>%
  mutate(ID = as.character(ID))

unc.wf.b <- left_join(unc.wf.b, zone_col, by = "ID")

ordi.plot <- ggplot(unc.wf.b) +
  geom_point(aes(x = LV1, y = LV2, color = Zone, shape = Year) , size = 2) +
  geom_text(data = unc.wf.fl, aes(x = V1, y = V2, label = labels),  colour="dark grey") +
  theme_classic() +
  labs(x = "Latent variable 1", y = "Latent variable 2")+
  scale_color_brewer(palette = "Dark2")

GLMM: ordination plot

ordi.plot

GLMM: ordination plot

We can compare the unconstrained ordination plot to a residual ordination plot, i.e., after including zone, year, their interaction and station in the model.

Random slopes model

Random slopes model example: PIRLS

  • Fit a random slopes model with heaps of slopes

  • Literacy of students across the world (PIRLS)

  • We are interested in the effect of the interaction between economic disadvantage and the size of the school library.

  • Allow the interaction to vary by country

Global literacy - PIRLS

Consider the model of overall literacy score, \(y_{ijk}\), for student \(i = 1, \ldots, n\), in school \(j = 1,\ldots, l\) and country \(k = 1, \ldots, p\) as follows:

\[ y_{ijk} = \boldsymbol{x_j'\beta} + {b_{j}^{[s]}} + \boldsymbol{x_j'b_{jk}^{[rr]}} + \epsilon_{ijk} \]

  • \(\boldsymbol{x_j}\) are vectors of covariates relating to school factors
  • \(\boldsymbol{\beta}\) are vectors of fixed coefficients
  • \({b_{j}^{[s]}} \sim N(0, \sigma^2_s)\)
  • \(\boldsymbol{b_{j}^{[rr]}} \sim N(\boldsymbol{0}, \boldsymbol{\Lambda \Lambda'})\) where \(\boldsymbol{\Lambda}\) is the full matrix of factor loadings.
  • \(\epsilon_{ijk} \sim N(0, \sigma^2)\)

PIRLS: reduced-rank model

This may take a little longer to fit

load("data/pirls_data.RData")

fit.rr <- glmmTMB(Overall ~  Size_lib*Eco_disad +
                    (1 |School) +
                    rr(Size_lib*Eco_disad | Country, 3),
                  data = pirls,
                  family = gaussian(),
                  control = glmmTMBControl(optCtrl=list(iter.max=1e5,eval.max=1e3),
                                           start_method = list(method = "res")))

PIRLS: reduced-rank estimates

PIRLS: random slopes plot

References

Brooks, Mollie E, Kasper Kristensen, Koen J van Benthem, Arni Magnusson, Casper W Berg, Anders Nielsen, Hans J Skaug, Martin Machler, and Benjamin M Bolker. 2017. “glmmTMB Balances Speed and Flexibility Among Packages for Zero-Inflated Generalized Linear Mixed Modeling.” The R Journal 9 (2): 378–400.